--- layout: post title: "HR Analytics: Predicting Attrition" date: 2016-05-25 description: "Predicting if an employee will leave the company?" main-class: 'python' postbook: 'inotebook' published: true color: '#b31917' tags: - python - pandas - 'HR Analytics' - "linear regression" - lasso - ridge - boosting - 'random forest' - "explortory analysis" categories: - python introduction: "Predictive Analytics using HR Analytics dataset in kaggle" ---
Human Resources are critical resources of any organiazation. Organizations spend huge amount of time and money to hire and nuture their employees. It is a huge loss for companies if employees leave, especially the key resources. So if HR can predict weather employees are at risk for leaving the company, it will allow them to identify the attrition risks and help understand and provie necessary support to retain those employees or do preventive hiring to minimize the impact to the orgranization.
This dataset is taken from kaggle https://www.kaggle.com/ludobenistant/hr-analytics
Fields in the dataset include:
import pandas as pd
import numpy as np
hr_df = pd.read_csv( 'HR_comma_sep.csv' )
hr_df[0:5]
hr_df.columns
numerical_features = ['satisfaction_level', 'last_evaluation', 'number_project',
'average_montly_hours', 'time_spend_company']
categorical_features = ['Work_accident','promotion_last_5years', 'sales', 'salary']
def create_dummies( df, colname ):
col_dummies = pd.get_dummies(df[colname], prefix=colname)
col_dummies.drop(col_dummies.columns[0], axis=1, inplace=True)
df = pd.concat([df, col_dummies], axis=1)
df.drop( colname, axis = 1, inplace = True )
return df
for c_feature in categorical_features:
hr_df = create_dummies( hr_df, c_feature )
hr_df[0:5]
feature_columns = hr_df.columns.difference( ['left'] )
feature_columns
from sklearn.cross_validation import train_test_split
train_X, test_X, train_y, test_y = train_test_split( hr_df[feature_columns],
hr_df['left'],
test_size = 0.2,
random_state = 42 )
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit( train_X, train_y )
list( zip( feature_columns, logreg.coef_[0] ) )
logreg.intercept_
hr_test_pred = pd.DataFrame( { 'actual': test_y,
'predicted': logreg.predict( test_X ) } )
hr_test_pred = hr_test_pred.reset_index()
hr_test_pred.sample( n = 10 )
from sklearn import metrics
cm = metrics.confusion_matrix( hr_test_pred.actual,
hr_test_pred.predicted, [1,0] )
cm
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
sn.heatmap(cm, annot=True, fmt='.2f', xticklabels = ["Left", "No Left"] , yticklabels = ["Left", "No Left"] )
plt.ylabel('True label')
plt.xlabel('Predicted label')
score = metrics.accuracy_score( hr_test_pred.actual, hr_test_pred.predicted )
round( float(score), 2 )
Overall test accuracy is 78%. But it is not a good measure. The result is very high as there are lots of cases which are no left and the model has predicted most of them as no left.
The objective of the model is to indentify the people who will leave, so that the company can intervene and act.
This might be the case as the default model assumes people with more than 0.5 probability will not leave the company.
test_X[:1]
logreg.predict_proba( test_X[:1] )
The model is predicting the probability of him leaving the company is only 0.027, which is very low.
predict_proba_df = pd.DataFrame( logreg.predict_proba( test_X ) )
predict_proba_df.head()
hr_test_pred = pd.concat( [hr_test_pred, predict_proba_df], axis = 1 )
hr_test_pred.columns = ['index', 'actual', 'predicted', 'Left_0', 'Left_1']
auc_score = metrics.roc_auc_score( hr_test_pred.actual, hr_test_pred.Left_1 )
round( float( auc_score ), 2 )
sn.distplot( hr_test_pred[hr_test_pred.actual == 1]["Left_1"], color = 'b' )
sn.distplot( hr_test_pred[hr_test_pred.actual == 0]["Left_1"], color = 'g' )
fpr, tpr, thresholds = metrics.roc_curve( hr_test_pred.actual,
hr_test_pred.Left_1,
drop_intermediate = False )
plt.figure(figsize=(6, 4))
plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()
thresholds[0:10]
fpr[0:10]
tpr[0:10]
cutoff_prob = thresholds[(np.abs(tpr - 0.7)).argmin()]
round( float( cutoff_prob ), 2 )
hr_test_pred['new_labels'] = hr_test_pred['Left_1'].map( lambda x: 1 if x >= 0.28 else 0 )
hr_test_pred[0:10]
cm = metrics.confusion_matrix( hr_test_pred.actual,
hr_test_pred.new_labels, [1,0] )
sn.heatmap(cm, annot=True, fmt='.2f', xticklabels = ["Left", "No Left"] , yticklabels = ["Left", "No Left"] )
plt.ylabel('True label')
plt.xlabel('Predicted label')
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier, export_graphviz, export
from sklearn.grid_search import GridSearchCV
param_grid = {'max_depth': np.arange(3, 10)}
tree = GridSearchCV(DecisionTreeClassifier(), param_grid, cv = 10)
tree.fit( train_X, train_y )
tree.best_params_
tree.best_score_
clf_tree = DecisionTreeClassifier( max_depth = 9 )
clf_tree.fit( train_X, train_y, )
Wow! the accuracy is about 98%.
tree_test_pred = pd.DataFrame( { 'actual': test_y,
'predicted': clf_tree.predict( test_X ) } )
tree_test_pred.sample( n = 10 )
metrics.accuracy_score( tree_test_pred.actual, tree_test_pred.predicted )
tree_cm = metrics.confusion_matrix( tree_test_pred.predicted,
tree_test_pred.actual,
[1,0] )
sn.heatmap(tree_cm, annot=True,
fmt='.2f',
xticklabels = ["Left", "No Left"] , yticklabels = ["Left", "No Left"] )
plt.ylabel('True label')
plt.xlabel('Predicted label')
export_graphviz( clf_tree,
out_file = "hr_tree.odt",
feature_names = train_X.columns )
import pydotplus as pdot
chd_tree_graph = pdot.graphviz.graph_from_dot_file( 'hr_tree.odt' )
chd_tree_graph.write_jpg( 'hr_tree.jpg' )
from IPython.display import Image
Image(filename='hr_tree.jpg')
from sklearn.ensemble import RandomForestClassifier
radm_clf = RandomForestClassifier()
radm_clf.fit( train_X, train_y )
radm_test_pred = pd.DataFrame( { 'actual': test_y,
'predicted': radm_clf.predict( test_X ) } )
metrics.accuracy_score( radm_test_pred.actual, radm_test_pred.predicted )
tree_cm = metrics.confusion_matrix( radm_test_pred.predicted,
radm_test_pred.actual,
[1,0] )
sn.heatmap(tree_cm, annot=True,
fmt='.2f',
xticklabels = ["Left", "No Left"] , yticklabels = ["Left", "No Left"] )
plt.ylabel('True label')
plt.xlabel('Predicted label')
indices = np.argsort(radm_clf.feature_importances_)[::-1]
feature_rank = pd.DataFrame( columns = ['rank', 'feature', 'importance'] )
for f in range(train_X.shape[1]):
feature_rank.loc[f] = [f+1,
train_X.columns[indices[f]],
radm_clf.feature_importances_[indices[f]]]
sn.barplot( y = 'feature', x = 'importance', data = feature_rank )
As per the model, the most important features which influence whether to leave the company,in descending order, are
Human Resources are critical resources of any organiazation. Organizations spend huge amount of time and money to hire and nuture their employees. It is a huge loss for companies if employees leave, especially the key resources. So if HR can predict weather employees are at risk for leaving the company, it will allow them to identify the attrition risks and help understand and provie necessary support to retain those employees or do preventive hiring to minimize the impact to the orgranization.
This dataset is taken from kaggle https://www.kaggle.com/ludobenistant/hr-analytics
Fields in the dataset include:
import pandas as pd
import numpy as np
hr_df = pd.read_csv( 'HR_comma_sep.csv' )
hr_df[0:5]
hr_df.columns
numerical_features = ['satisfaction_level', 'last_evaluation', 'number_project',
'average_montly_hours', 'time_spend_company']
categorical_features = ['Work_accident','promotion_last_5years', 'sales', 'salary']
def create_dummies( df, colname ):
col_dummies = pd.get_dummies(df[colname], prefix=colname)
col_dummies.drop(col_dummies.columns[0], axis=1, inplace=True)
df = pd.concat([df, col_dummies], axis=1)
df.drop( colname, axis = 1, inplace = True )
return df
for c_feature in categorical_features:
hr_df = create_dummies( hr_df, c_feature )
hr_df[0:5]
feature_columns = hr_df.columns.difference( ['left'] )
feature_columns
from sklearn.cross_validation import train_test_split
train_X, test_X, train_y, test_y = train_test_split( hr_df[feature_columns],
hr_df['left'],
test_size = 0.2,
random_state = 42 )
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit( train_X, train_y )
list( zip( feature_columns, logreg.coef_[0] ) )
logreg.intercept_
hr_test_pred = pd.DataFrame( { 'actual': test_y,
'predicted': logreg.predict( test_X ) } )
hr_test_pred = hr_test_pred.reset_index()
hr_test_pred.sample( n = 10 )
from sklearn import metrics
cm = metrics.confusion_matrix( hr_test_pred.actual,
hr_test_pred.predicted, [1,0] )
cm
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
sn.heatmap(cm, annot=True, fmt='.2f', xticklabels = ["Left", "No Left"] , yticklabels = ["Left", "No Left"] )
plt.ylabel('True label')
plt.xlabel('Predicted label')
score = metrics.accuracy_score( hr_test_pred.actual, hr_test_pred.predicted )
round( float(score), 2 )
Overall test accuracy is 78%. But it is not a good measure. The result is very high as there are lots of cases which are no left and the model has predicted most of them as no left.
The objective of the model is to indentify the people who will leave, so that the company can intervene and act.
This might be the case as the default model assumes people with more than 0.5 probability will not leave the company.
test_X[:1]
logreg.predict_proba( test_X[:1] )
The model is predicting the probability of him leaving the company is only 0.027, which is very low.
predict_proba_df = pd.DataFrame( logreg.predict_proba( test_X ) )
predict_proba_df.head()
hr_test_pred = pd.concat( [hr_test_pred, predict_proba_df], axis = 1 )
hr_test_pred.columns = ['index', 'actual', 'predicted', 'Left_0', 'Left_1']
auc_score = metrics.roc_auc_score( hr_test_pred.actual, hr_test_pred.Left_1 )
round( float( auc_score ), 2 )
sn.distplot( hr_test_pred[hr_test_pred.actual == 1]["Left_1"], color = 'b' )
sn.distplot( hr_test_pred[hr_test_pred.actual == 0]["Left_1"], color = 'g' )
fpr, tpr, thresholds = metrics.roc_curve( hr_test_pred.actual,
hr_test_pred.Left_1,
drop_intermediate = False )
plt.figure(figsize=(6, 4))
plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()
thresholds[0:10]
fpr[0:10]
tpr[0:10]
cutoff_prob = thresholds[(np.abs(tpr - 0.7)).argmin()]
round( float( cutoff_prob ), 2 )
hr_test_pred['new_labels'] = hr_test_pred['Left_1'].map( lambda x: 1 if x >= 0.28 else 0 )
hr_test_pred[0:10]
cm = metrics.confusion_matrix( hr_test_pred.actual,
hr_test_pred.new_labels, [1,0] )
sn.heatmap(cm, annot=True, fmt='.2f', xticklabels = ["Left", "No Left"] , yticklabels = ["Left", "No Left"] )
plt.ylabel('True label')
plt.xlabel('Predicted label')
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier, export_graphviz, export
from sklearn.grid_search import GridSearchCV
param_grid = {'max_depth': np.arange(3, 10)}
tree = GridSearchCV(DecisionTreeClassifier(), param_grid, cv = 10)
tree.fit( train_X, train_y )
tree.best_params_
tree.best_score_
clf_tree = DecisionTreeClassifier( max_depth = 9 )
clf_tree.fit( train_X, train_y, )
Wow! the accuracy is about 98%.
tree_test_pred = pd.DataFrame( { 'actual': test_y,
'predicted': clf_tree.predict( test_X ) } )
tree_test_pred.sample( n = 10 )
metrics.accuracy_score( tree_test_pred.actual, tree_test_pred.predicted )
tree_cm = metrics.confusion_matrix( tree_test_pred.predicted,
tree_test_pred.actual,
[1,0] )
sn.heatmap(tree_cm, annot=True,
fmt='.2f',
xticklabels = ["Left", "No Left"] , yticklabels = ["Left", "No Left"] )
plt.ylabel('True label')
plt.xlabel('Predicted label')
export_graphviz( clf_tree,
out_file = "hr_tree.odt",
feature_names = train_X.columns )
import pydotplus as pdot
chd_tree_graph = pdot.graphviz.graph_from_dot_file( 'hr_tree.odt' )
chd_tree_graph.write_jpg( 'hr_tree.jpg' )
from IPython.display import Image
Image(filename='hr_tree.jpg')
from sklearn.ensemble import RandomForestClassifier
radm_clf = RandomForestClassifier()
radm_clf.fit( train_X, train_y )
radm_test_pred = pd.DataFrame( { 'actual': test_y,
'predicted': radm_clf.predict( test_X ) } )
metrics.accuracy_score( radm_test_pred.actual, radm_test_pred.predicted )
tree_cm = metrics.confusion_matrix( radm_test_pred.predicted,
radm_test_pred.actual,
[1,0] )
sn.heatmap(tree_cm, annot=True,
fmt='.2f',
xticklabels = ["Left", "No Left"] , yticklabels = ["Left", "No Left"] )
plt.ylabel('True label')
plt.xlabel('Predicted label')
indices = np.argsort(radm_clf.feature_importances_)[::-1]
feature_rank = pd.DataFrame( columns = ['rank', 'feature', 'importance'] )
for f in range(train_X.shape[1]):
feature_rank.loc[f] = [f+1,
train_X.columns[indices[f]],
radm_clf.feature_importances_[indices[f]]]
sn.barplot( y = 'feature', x = 'importance', data = feature_rank )
As per the model, the most important features which influence whether to leave the company,in descending order, are